Machine Learning Final Project

Authors
Affiliation

Adam Lodrik

University of Lausanne

Favre Stefan

Macaraeg Jeff

Published

May 14, 2024

Abstract

The following machine learning project focuses on…

Warning in fun(libname, pkgname): couldn't connect to display ":0"

1 Introduction

• The context and background: course, company name, business context.

During our 1st master year as students in Management - orientation Business Analytics, we have had the opportunity to attend some lectures of Machine Learning for Business Analytics. In content of this class, we have seen multiple machine learning techniques for business context, mainly covering supervised (regressions, trees, support vector machine, neural networks) and unsupervised methods (clustering, PCA, FAMD, Auto-Encoder) but also other topics such as data splitting, ensemble methods and metrics.

• Aim of the investigation: major terms should be defined, the question of research (more generally the issue), why it is of interest and relevant in that context.

In the context of this class, our group have had the opportunity to work on an applied project. From scratch, we had to look for some potential dataset for using on real cases what we have learned in class. Thus, we had found an interesting dataset concerning vehicule MPG, range, engine stats and more, for more than 100 brands. The goal of our research was to predict the make (i.e. the brand) of the car according to its characteristics (consumption, range, fuel type, … ) thanks to a model that we would have trained (using RF, ANN or Trees). As some cars could have several identical characteristics, but could differentiate on various other ones, we thought that it would be pertinent to have a model that was able to predict a car brand, from its features.

• Description of the data and the general material provided and how it was made available (and/or collected, if it is relevant). Only in broad terms however, the data will be further described in a following section. Typically, the origin/source of the data (the company, webpage, etc.), the type of files (Excel files, etc.), and what it contains in broad terms (e.g. “a file containing weekly sales with the factors of interest including in particular the promotion characteristics”).

The csv dataset has been found on data.world, a data catalog platform that gather various open access datasets online. The file contains more than 45’000 rows and 26 columns, each colomn concerning one feature (such as the year of the brand, the model, the consumption per barrel, the highway mpg per fuel type and so on).

• The method that is used, in broad terms, no details needed at this point. E.g. “Model based machine learning will help us quantifying the important factors on the sales”.

Among these columns, we have had to find a machine learning model that could help us quantify the importance of the features in predicting the make of the car. Various models will be tried for both supervised and unsupervised learnings.

• An outlook: a short paragraph indicating from now what will be treated in each following sections/chapters. E.g. “in Section 3, we describe the data. Section 4 is dedicated to the presentation of the text mining methods…” In the following sections, you will find 1st the description in the data, then in Section 2 the method used, in Section 3 the results, in Section 4 our conclusion and recommendations and finally in Section 5 our references. From now on, we will go through different sections. Section 2 will be dedicated in the data description in more depth, mentioning the variables and features, the instances, the type of data and eventually some missing data patterns. Then, the next section will cover Exploratory Data Analysis (EDA), where some vizualisations will be made in order to better perceive some patterns in the variables as well as potential correlation. After that, section 4 will be about the methods which will first be divided between Supervised and then Unsupervised in order to find a suitable model for our project. The results will be discussed right after and we will proceed with a conclusion, as well as recommendations and discussions. Finally, the references and appendix will be visible at the end of the report.

2 Data description

  • Description of the data file format (xlsx, csv, text, video, etc.) DONE
  • The features or variables: type, units, the range (e.g. the time, numerical, in weeks from January 1, 2012 to December 31, 2015), their coding (numerical, the levels for categorical, etc.), etc. TABLE-NTBF
  • The instances: customers, company, products, subjects, etc. DONE
  • Missing data pattern: if there are missing data, if they are specific to some features, etc. NTBD
  • Any modification to the initial data: aggregation, imputation in replacement of missing data, recoding of levels, etc. NTBD
  • If only a subset was used, it should be mentioned and explained; e.g. inclusion criteria. Note that if inclusion criteria do not exist and the inclusion was an arbitrary choice, it should be stated as such. One should not try to invent unreal justifications. NTBD

“For this project, we selected a dataset focused on vehicle characteristics, available as a .csv file from data.world. You can access the dataset via the following link: data.world. It includes a total of 26 features describing 45,896 vehicle models released between 1984 and 2023. Below is a table providing an overview of the available features and their descriptions. You can find a deeper description of the data in ?@sec-Annex.”

2.0.1 The features or variables: type, units,…

Variable Name Explanation
ID Number corresponding to the precise combination of the features of the model
Model Year Year of the model of the car
Make The brand of the car
Model The model of the car
Estimated Annual Petroleum Consumption (Barrels) Consumption in Petroleum Barrels
Fuel Type 1 First fuel energy source, only source if not an hybrid car
City MPG (Fuel Type 1)
Highway MPG (Fuel Type 1)
Combined MPG (Fuel Type 1)
Fuel Type 2 Second energy source if hybrid car
City MPG (Fuel Type 2)
Highway MPG (Fuel Type 2)
Combined MPG (Fuel Type 2)
Engine Cylinders From 2 to 16 cylinders
Engine Displacement Measure of the cylinder volume swept by all of the pistons of a piston engine, excluding the combustion chambers
Drive Description of the car, e.g. Turbo, Stop-Start, ...
Engine Description Manual/Automatic transmission, with number of gears and/or model of transmission
Transmission e.g. Minivan, Trucks, Midsize,....
Vehicle Class
Time to Charge EV (hours at 120v)
Time to Charge EV (hours at 240v)
Range (for EV)
City Range (for EV - Fuel Type 1)
City Range (for EV - Fuel Type 2)
Hwy Range (for EV - Fuel Type 1)
Hwy Range (for EV - Fuel Type 2)

2.1 The instances: customers, company, products, subjects, etc.

In a basic instance, each row is concerning one car. We can find in order the ID of the car corresponding to a precise feature observation, then the features as seen in the table before.

2.2 Missing data pattern: if there are missing data, if they are specific to some features, etc.

2.3 Any modification to the initial data: aggregation, imputation in replacement of missing data, recoding of levels, etc.

2.4 If only a subset was used, it should be mentioned and explained; e.g. inclusion criteria. Note that if inclusion criteria do not exist and the inclusion was an arbitrary choice, it should be stated as such. One should not try to invent unreal justifications.

EDA:

Columns description

To begin with our EDA, let’s have a look at our dataset and in particular the characteristics of the columns.

Show the code
#to get a detailed summary
skim(data)
Data summary
Name data
Number of rows 45896
Number of columns 26
_______________________
Column type frequency:
character 8
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Make 0 1.00 3 34 0 141 0
Model 0 1.00 1 47 0 4762 0
Fuel Type 1 0 1.00 6 17 0 6 0
Fuel Type 2 44059 0.04 3 11 0 4 0
Drive 1186 0.97 13 26 0 7 0
Engine Description 17031 0.63 1 46 0 589 0
Transmission 11 1.00 12 32 0 40 0
Vehicle Class 0 1.00 4 34 0 34 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1.00 23102.11 13403.10 1.00 11474.75 23090.50 34751.25 46332.00 ▇▇▇▇▇
Model Year 0 1.00 2003.61 12.19 1984.00 1992.00 2005.00 2015.00 2023.00 ▇▆▆▇▇
Estimated Annual Petrolum Consumption (Barrels) 0 1.00 15.33 4.34 0.05 12.94 14.88 17.50 42.50 ▁▇▃▁▁
City MPG (Fuel Type 1) 0 1.00 19.11 10.31 6.00 15.00 17.00 21.00 150.00 ▇▁▁▁▁
Highway MPG (Fuel Type 1) 0 1.00 25.16 9.40 9.00 20.00 24.00 28.00 140.00 ▇▁▁▁▁
Combined MPG (Fuel Type 1) 0 1.00 21.33 9.78 7.00 17.00 20.00 23.00 142.00 ▇▁▁▁▁
City MPG (Fuel Type 2) 0 1.00 0.85 6.47 0.00 0.00 0.00 0.00 145.00 ▇▁▁▁▁
Highway MPG (Fuel Type 2) 0 1.00 1.00 6.55 0.00 0.00 0.00 0.00 121.00 ▇▁▁▁▁
Combined MPG (Fuel Type 2) 0 1.00 0.90 6.43 0.00 0.00 0.00 0.00 133.00 ▇▁▁▁▁
Engine Cylinders 487 0.99 5.71 1.77 2.00 4.00 6.00 6.00 16.00 ▇▇▅▁▁
Engine Displacement 485 0.99 3.28 1.36 0.00 2.20 3.00 4.20 8.40 ▁▇▅▂▁
Time to Charge EV (hours at 120v) 0 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
Time to Charge EV (hours at 240v) 0 1.00 0.11 1.01 0.00 0.00 0.00 0.00 15.30 ▇▁▁▁▁
Range (for EV) 0 1.00 2.36 24.97 0.00 0.00 0.00 0.00 520.00 ▇▁▁▁▁
City Range (for EV - Fuel Type 1) 0 1.00 1.62 20.89 0.00 0.00 0.00 0.00 520.80 ▇▁▁▁▁
City Range (for EV - Fuel Type 2) 0 1.00 0.17 2.73 0.00 0.00 0.00 0.00 135.28 ▇▁▁▁▁
Hwy Range (for EV - Fuel Type 1) 0 1.00 1.51 19.70 0.00 0.00 0.00 0.00 520.50 ▇▁▁▁▁
Hwy Range (for EV - Fuel Type 2) 0 1.00 0.16 2.46 0.00 0.00 0.00 0.00 114.76 ▇▁▁▁▁

The dataset that we are working with contains approx. 46’000 rows and 26 columns. We can see that most of our features are concerning the consumption of the cars. In addition, we notice that some variables contain a lot of missing and that the variable “Time.to.Charge.EV..hours.at.120v.” is only containing 0s. We will be handle these in the section “data cleaning”.

Exploration of the distribution Here are more details about the distribution of the numerical features. ::: {.cell}

Show the code
#  melt.data <- melt(data)
# 
#  ggplot(data = melt.data, aes(x = value)) + 
# stat_density() + 
# facet_wrap(~variable, scales = "free")

plot_histogram(data)# Time.to.Charge.EV..hours.at.120v. not appearing because all observations = 0 

::: ::: {.cell hash=‘report_html_cache/html/unnamed-chunk-6_6fdfa5f4fe30c3a8f031ecf732357900’}

Show the code
#tentative boxplots

# data_long <- data %>%
#   select_if(is.numeric) %>%
#   pivot_longer(cols = c("ID",                                             
#   "Model.Year",                                     
#  "Estimated.Annual.Petrolum.Consumption..Barrels.",
#  "City.MPG..Fuel.Type.1.",                         
#  "Highway.MPG..Fuel.Type.1." ,                     
#  "Combined.MPG..Fuel.Type.1.",                     
#  "City.MPG..Fuel.Type.2." ,                        
#  "Highway.MPG..Fuel.Type.2.",                      
#  "Combined.MPG..Fuel.Type.2."   ,                  
# "Time.to.Charge.EV..hours.at.120v." ,             
#  "Time.to.Charge.EV..hours.at.240v." ,             
# "Range..for.EV.",                                 
# "City.Range..for.EV...Fuel.Type.1.",              
# "City.Range..for.EV...Fuel.Type.2.",              
# "Hwy.Range..for.EV...Fuel.Type.1." ,              
# "Hwy.Range..for.EV...Fuel.Type.2." ), names_to = "variable", values_to = "value")
# 
# ggplot(data_long, aes(x = variable, y = value, fill = variable)) +
#   geom_boxplot() +
#   facet_wrap(~ variable, scales = "free_y") +  # Each variable gets its own y-axis
#   theme_minimal() +
#   labs(title = "Boxplots of Variables with Different Scales", x = "", y = "Value")

:::

Show the code
#Now 
# plot_correlation(data) #drop time charge EV 120V
# create_report(data)
#nb cars per brand

number of models per make ::: {.cell}

Show the code
#Number of occurences/model per make 
nb_model_per_make <- data %>%
  group_by(Make, Model) %>%
  summarise(Number = n(), .groups = 'drop') %>%
  group_by(Make) %>%
  summarise(Models_Per_Make = n(), .groups = 'drop') %>%
  arrange(desc(Models_Per_Make))

#table
datatable(nb_model_per_make,
          rownames = FALSE,
          options = list(pageLength = 10,
                         class = "hover",
                         searchHighlight = TRUE))
Show the code
# Reordering the Make variable within the plotting code to make it ordered by Models_Per_Make descending
nb_model_per_make$Make <- factor(nb_model_per_make$Make, levels = nb_model_per_make$Make[order(-nb_model_per_make$Models_Per_Make)])

# Bar Plot
ggplot(nb_model_per_make, aes(x = Models_Per_Make, y = Make, fill = Make)) +
  geom_bar(stat = "identity", color = "black", show.legend = FALSE) +
  labs(title = "Bar Plot of Models per Make",
       x = "Make",
       y = "Number of Models") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x labels for better visibility

:::

Correlation matrix for numerical features ::: {.cell}

Show the code
library(corrplot)
corrplot 0.92 loaded
Show the code
library(reshape2) 

Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':

    smiths
Show the code
#select only numerical columns, drop Time.to.Charge.EV..hours.at.120v. because NAs
data_corrplot <- data %>%
  select_if(is.numeric)


#correlation transformation for plot
cor_matrix <- cor(data_corrplot, use = "complete.obs")
Warning in cor(data_corrplot, use = "complete.obs"): the standard deviation is
zero
Show the code
print(cor_matrix)
                                                          ID   Model Year
ID                                               1.000000000  0.898172038
Model Year                                       0.898172038  1.000000000
Estimated Annual Petrolum Consumption (Barrels) -0.266893652 -0.276815795
City MPG (Fuel Type 1)                           0.241377134  0.228012823
Highway MPG (Fuel Type 1)                        0.283084314  0.291126991
Combined MPG (Fuel Type 1)                       0.261835243  0.256301039
City MPG (Fuel Type 2)                           0.134630658  0.134178928
Highway MPG (Fuel Type 2)                        0.146379198  0.147794175
Combined MPG (Fuel Type 2)                       0.140054478  0.140327837
Engine Cylinders                                 0.033692800  0.050183765
Engine Displacement                             -0.003199825  0.003489418
Time to Charge EV (hours at 120v)                         NA           NA
Time to Charge EV (hours at 240v)                0.105358751  0.099465867
Range (for EV)                                            NA           NA
City Range (for EV - Fuel Type 1)                         NA           NA
City Range (for EV - Fuel Type 2)                0.087660644  0.082763271
Hwy Range (for EV - Fuel Type 1)                          NA           NA
Hwy Range (for EV - Fuel Type 2)                 0.091333071  0.086200984
                                                Estimated Annual Petrolum Consumption (Barrels)
ID                                                                                   -0.2668937
Model Year                                                                           -0.2768158
Estimated Annual Petrolum Consumption (Barrels)                                       1.0000000
City MPG (Fuel Type 1)                                                               -0.8653379
Highway MPG (Fuel Type 1)                                                            -0.9035610
Combined MPG (Fuel Type 1)                                                           -0.9001868
City MPG (Fuel Type 2)                                                               -0.1689445
Highway MPG (Fuel Type 2)                                                            -0.1539511
Combined MPG (Fuel Type 2)                                                           -0.1637478
Engine Cylinders                                                                      0.7331933
Engine Displacement                                                                   0.7837093
Time to Charge EV (hours at 120v)                                                            NA
Time to Charge EV (hours at 240v)                                                    -0.1859223
Range (for EV)                                                                               NA
City Range (for EV - Fuel Type 1)                                                            NA
City Range (for EV - Fuel Type 2)                                                    -0.1727049
Hwy Range (for EV - Fuel Type 1)                                                             NA
Hwy Range (for EV - Fuel Type 2)                                                     -0.1766306
                                                City MPG (Fuel Type 1)
ID                                                           0.2413771
Model Year                                                   0.2280128
Estimated Annual Petrolum Consumption (Barrels)             -0.8653379
City MPG (Fuel Type 1)                                       1.0000000
Highway MPG (Fuel Type 1)                                    0.9207665
Combined MPG (Fuel Type 1)                                   0.9857637
City MPG (Fuel Type 2)                                       0.1671384
Highway MPG (Fuel Type 2)                                    0.1420879
Combined MPG (Fuel Type 2)                                   0.1574358
Engine Cylinders                                            -0.6771928
Engine Displacement                                         -0.7115445
Time to Charge EV (hours at 120v)                                   NA
Time to Charge EV (hours at 240v)                            0.1414919
Range (for EV)                                                      NA
City Range (for EV - Fuel Type 1)                                   NA
City Range (for EV - Fuel Type 2)                            0.1530046
Hwy Range (for EV - Fuel Type 1)                                    NA
Hwy Range (for EV - Fuel Type 2)                             0.1510883
                                                Highway MPG (Fuel Type 1)
ID                                                             0.28308431
Model Year                                                     0.29112699
Estimated Annual Petrolum Consumption (Barrels)               -0.90356096
City MPG (Fuel Type 1)                                         0.92076647
Highway MPG (Fuel Type 1)                                      1.00000000
Combined MPG (Fuel Type 1)                                     0.96771702
City MPG (Fuel Type 2)                                         0.08903949
Highway MPG (Fuel Type 2)                                      0.07514936
Combined MPG (Fuel Type 2)                                     0.08359213
Engine Cylinders                                              -0.64689904
Engine Displacement                                           -0.70631422
Time to Charge EV (hours at 120v)                                      NA
Time to Charge EV (hours at 240v)                              0.07127195
Range (for EV)                                                         NA
City Range (for EV - Fuel Type 1)                                      NA
City Range (for EV - Fuel Type 2)                              0.08014444
Hwy Range (for EV - Fuel Type 1)                                       NA
Hwy Range (for EV - Fuel Type 2)                               0.07924589
                                                Combined MPG (Fuel Type 1)
ID                                                               0.2618352
Model Year                                                       0.2563010
Estimated Annual Petrolum Consumption (Barrels)                 -0.9001868
City MPG (Fuel Type 1)                                           0.9857637
Highway MPG (Fuel Type 1)                                        0.9677170
Combined MPG (Fuel Type 1)                                       1.0000000
City MPG (Fuel Type 2)                                           0.1365411
Highway MPG (Fuel Type 2)                                        0.1157063
Combined MPG (Fuel Type 2)                                       0.1284624
Engine Cylinders                                                -0.6825224
Engine Displacement                                             -0.7267671
Time to Charge EV (hours at 120v)                                       NA
Time to Charge EV (hours at 240v)                                0.1153313
Range (for EV)                                                          NA
City Range (for EV - Fuel Type 1)                                       NA
City Range (for EV - Fuel Type 2)                                0.1256720
Hwy Range (for EV - Fuel Type 1)                                        NA
Hwy Range (for EV - Fuel Type 2)                                 0.1242627
                                                City MPG (Fuel Type 2)
ID                                                          0.13463066
Model Year                                                  0.13417893
Estimated Annual Petrolum Consumption (Barrels)            -0.16894449
City MPG (Fuel Type 1)                                      0.16713838
Highway MPG (Fuel Type 1)                                   0.08903949
Combined MPG (Fuel Type 1)                                  0.13654114
City MPG (Fuel Type 2)                                      1.00000000
Highway MPG (Fuel Type 2)                                   0.98322734
Combined MPG (Fuel Type 2)                                  0.99700069
Engine Cylinders                                           -0.02312181
Engine Displacement                                        -0.02485931
Time to Charge EV (hours at 120v)                                   NA
Time to Charge EV (hours at 240v)                           0.83003369
Range (for EV)                                                      NA
City Range (for EV - Fuel Type 1)                                   NA
City Range (for EV - Fuel Type 2)                           0.79927886
Hwy Range (for EV - Fuel Type 1)                                    NA
Hwy Range (for EV - Fuel Type 2)                            0.81001967
                                                Highway MPG (Fuel Type 2)
ID                                                            0.146379198
Model Year                                                    0.147794175
Estimated Annual Petrolum Consumption (Barrels)              -0.153951086
City MPG (Fuel Type 1)                                        0.142087885
Highway MPG (Fuel Type 1)                                     0.075149357
Combined MPG (Fuel Type 1)                                    0.115706313
City MPG (Fuel Type 2)                                        0.983227343
Highway MPG (Fuel Type 2)                                     1.000000000
Combined MPG (Fuel Type 2)                                    0.994192508
Engine Cylinders                                             -0.005619978
Engine Displacement                                          -0.005966497
Time to Charge EV (hours at 120v)                                      NA
Time to Charge EV (hours at 240v)                             0.799168034
Range (for EV)                                                         NA
City Range (for EV - Fuel Type 1)                                      NA
City Range (for EV - Fuel Type 2)                             0.739164873
Hwy Range (for EV - Fuel Type 1)                                       NA
Hwy Range (for EV - Fuel Type 2)                              0.755992195
                                                Combined MPG (Fuel Type 2)
ID                                                              0.14005448
Model Year                                                      0.14032784
Estimated Annual Petrolum Consumption (Barrels)                -0.16374776
City MPG (Fuel Type 1)                                          0.15743582
Highway MPG (Fuel Type 1)                                       0.08359213
Combined MPG (Fuel Type 1)                                      0.12846237
City MPG (Fuel Type 2)                                          0.99700069
Highway MPG (Fuel Type 2)                                       0.99419251
Combined MPG (Fuel Type 2)                                      1.00000000
Engine Cylinders                                               -0.01618583
Engine Displacement                                            -0.01738760
Time to Charge EV (hours at 120v)                                       NA
Time to Charge EV (hours at 240v)                               0.82239662
Range (for EV)                                                          NA
City Range (for EV - Fuel Type 1)                                       NA
City Range (for EV - Fuel Type 2)                               0.77761086
Hwy Range (for EV - Fuel Type 1)                                        NA
Hwy Range (for EV - Fuel Type 2)                                0.79135465
                                                Engine Cylinders
ID                                                   0.033692800
Model Year                                           0.050183765
Estimated Annual Petrolum Consumption (Barrels)      0.733193261
City MPG (Fuel Type 1)                              -0.677192760
Highway MPG (Fuel Type 1)                           -0.646899039
Combined MPG (Fuel Type 1)                          -0.682522397
City MPG (Fuel Type 2)                              -0.023121807
Highway MPG (Fuel Type 2)                           -0.005619978
Combined MPG (Fuel Type 2)                          -0.016185826
Engine Cylinders                                     1.000000000
Engine Displacement                                  0.905190858
Time to Charge EV (hours at 120v)                             NA
Time to Charge EV (hours at 240v)                   -0.049696335
Range (for EV)                                                NA
City Range (for EV - Fuel Type 1)                             NA
City Range (for EV - Fuel Type 2)                   -0.057700272
Hwy Range (for EV - Fuel Type 1)                              NA
Hwy Range (for EV - Fuel Type 2)                    -0.057492631
                                                Engine Displacement
ID                                                     -0.003199825
Model Year                                              0.003489418
Estimated Annual Petrolum Consumption (Barrels)         0.783709304
City MPG (Fuel Type 1)                                 -0.711544513
Highway MPG (Fuel Type 1)                              -0.706314224
Combined MPG (Fuel Type 1)                             -0.726767140
City MPG (Fuel Type 2)                                 -0.024859311
Highway MPG (Fuel Type 2)                              -0.005966497
Combined MPG (Fuel Type 2)                             -0.017387603
Engine Cylinders                                        0.905190858
Engine Displacement                                     1.000000000
Time to Charge EV (hours at 120v)                                NA
Time to Charge EV (hours at 240v)                      -0.060216571
Range (for EV)                                                   NA
City Range (for EV - Fuel Type 1)                                NA
City Range (for EV - Fuel Type 2)                      -0.062930177
Hwy Range (for EV - Fuel Type 1)                                 NA
Hwy Range (for EV - Fuel Type 2)                       -0.063488571
                                                Time to Charge EV (hours at 120v)
ID                                                                             NA
Model Year                                                                     NA
Estimated Annual Petrolum Consumption (Barrels)                                NA
City MPG (Fuel Type 1)                                                         NA
Highway MPG (Fuel Type 1)                                                      NA
Combined MPG (Fuel Type 1)                                                     NA
City MPG (Fuel Type 2)                                                         NA
Highway MPG (Fuel Type 2)                                                      NA
Combined MPG (Fuel Type 2)                                                     NA
Engine Cylinders                                                               NA
Engine Displacement                                                            NA
Time to Charge EV (hours at 120v)                                               1
Time to Charge EV (hours at 240v)                                              NA
Range (for EV)                                                                 NA
City Range (for EV - Fuel Type 1)                                              NA
City Range (for EV - Fuel Type 2)                                              NA
Hwy Range (for EV - Fuel Type 1)                                               NA
Hwy Range (for EV - Fuel Type 2)                                               NA
                                                Time to Charge EV (hours at 240v)
ID                                                                     0.10535875
Model Year                                                             0.09946587
Estimated Annual Petrolum Consumption (Barrels)                       -0.18592232
City MPG (Fuel Type 1)                                                 0.14149189
Highway MPG (Fuel Type 1)                                              0.07127195
Combined MPG (Fuel Type 1)                                             0.11533134
City MPG (Fuel Type 2)                                                 0.83003369
Highway MPG (Fuel Type 2)                                              0.79916803
Combined MPG (Fuel Type 2)                                             0.82239662
Engine Cylinders                                                      -0.04969633
Engine Displacement                                                   -0.06021657
Time to Charge EV (hours at 120v)                                              NA
Time to Charge EV (hours at 240v)                                      1.00000000
Range (for EV)                                                                 NA
City Range (for EV - Fuel Type 1)                                              NA
City Range (for EV - Fuel Type 2)                                      0.88124788
Hwy Range (for EV - Fuel Type 1)                                               NA
Hwy Range (for EV - Fuel Type 2)                                       0.90893793
                                                Range (for EV)
ID                                                          NA
Model Year                                                  NA
Estimated Annual Petrolum Consumption (Barrels)             NA
City MPG (Fuel Type 1)                                      NA
Highway MPG (Fuel Type 1)                                   NA
Combined MPG (Fuel Type 1)                                  NA
City MPG (Fuel Type 2)                                      NA
Highway MPG (Fuel Type 2)                                   NA
Combined MPG (Fuel Type 2)                                  NA
Engine Cylinders                                            NA
Engine Displacement                                         NA
Time to Charge EV (hours at 120v)                           NA
Time to Charge EV (hours at 240v)                           NA
Range (for EV)                                               1
City Range (for EV - Fuel Type 1)                           NA
City Range (for EV - Fuel Type 2)                           NA
Hwy Range (for EV - Fuel Type 1)                            NA
Hwy Range (for EV - Fuel Type 2)                            NA
                                                City Range (for EV - Fuel Type 1)
ID                                                                             NA
Model Year                                                                     NA
Estimated Annual Petrolum Consumption (Barrels)                                NA
City MPG (Fuel Type 1)                                                         NA
Highway MPG (Fuel Type 1)                                                      NA
Combined MPG (Fuel Type 1)                                                     NA
City MPG (Fuel Type 2)                                                         NA
Highway MPG (Fuel Type 2)                                                      NA
Combined MPG (Fuel Type 2)                                                     NA
Engine Cylinders                                                               NA
Engine Displacement                                                            NA
Time to Charge EV (hours at 120v)                                              NA
Time to Charge EV (hours at 240v)                                              NA
Range (for EV)                                                                 NA
City Range (for EV - Fuel Type 1)                                               1
City Range (for EV - Fuel Type 2)                                              NA
Hwy Range (for EV - Fuel Type 1)                                               NA
Hwy Range (for EV - Fuel Type 2)                                               NA
                                                City Range (for EV - Fuel Type 2)
ID                                                                     0.08766064
Model Year                                                             0.08276327
Estimated Annual Petrolum Consumption (Barrels)                       -0.17270494
City MPG (Fuel Type 1)                                                 0.15300461
Highway MPG (Fuel Type 1)                                              0.08014444
Combined MPG (Fuel Type 1)                                             0.12567199
City MPG (Fuel Type 2)                                                 0.79927886
Highway MPG (Fuel Type 2)                                              0.73916487
Combined MPG (Fuel Type 2)                                             0.77761086
Engine Cylinders                                                      -0.05770027
Engine Displacement                                                   -0.06293018
Time to Charge EV (hours at 120v)                                              NA
Time to Charge EV (hours at 240v)                                      0.88124788
Range (for EV)                                                                 NA
City Range (for EV - Fuel Type 1)                                              NA
City Range (for EV - Fuel Type 2)                                      1.00000000
Hwy Range (for EV - Fuel Type 1)                                               NA
Hwy Range (for EV - Fuel Type 2)                                       0.99601860
                                                Hwy Range (for EV - Fuel Type 1)
ID                                                                            NA
Model Year                                                                    NA
Estimated Annual Petrolum Consumption (Barrels)                               NA
City MPG (Fuel Type 1)                                                        NA
Highway MPG (Fuel Type 1)                                                     NA
Combined MPG (Fuel Type 1)                                                    NA
City MPG (Fuel Type 2)                                                        NA
Highway MPG (Fuel Type 2)                                                     NA
Combined MPG (Fuel Type 2)                                                    NA
Engine Cylinders                                                              NA
Engine Displacement                                                           NA
Time to Charge EV (hours at 120v)                                             NA
Time to Charge EV (hours at 240v)                                             NA
Range (for EV)                                                                NA
City Range (for EV - Fuel Type 1)                                             NA
City Range (for EV - Fuel Type 2)                                             NA
Hwy Range (for EV - Fuel Type 1)                                               1
Hwy Range (for EV - Fuel Type 2)                                              NA
                                                Hwy Range (for EV - Fuel Type 2)
ID                                                                    0.09133307
Model Year                                                            0.08620098
Estimated Annual Petrolum Consumption (Barrels)                      -0.17663055
City MPG (Fuel Type 1)                                                0.15108834
Highway MPG (Fuel Type 1)                                             0.07924589
Combined MPG (Fuel Type 1)                                            0.12426275
City MPG (Fuel Type 2)                                                0.81001967
Highway MPG (Fuel Type 2)                                             0.75599220
Combined MPG (Fuel Type 2)                                            0.79135465
Engine Cylinders                                                     -0.05749263
Engine Displacement                                                  -0.06348857
Time to Charge EV (hours at 120v)                                             NA
Time to Charge EV (hours at 240v)                                     0.90893793
Range (for EV)                                                                NA
City Range (for EV - Fuel Type 1)                                             NA
City Range (for EV - Fuel Type 2)                                     0.99601860
Hwy Range (for EV - Fuel Type 1)                                              NA
Hwy Range (for EV - Fuel Type 2)                                      1.00000000
Show the code
kable(cor_matrix)
ID Model Year Estimated Annual Petrolum Consumption (Barrels) City MPG (Fuel Type 1) Highway MPG (Fuel Type 1) Combined MPG (Fuel Type 1) City MPG (Fuel Type 2) Highway MPG (Fuel Type 2) Combined MPG (Fuel Type 2) Engine Cylinders Engine Displacement Time to Charge EV (hours at 120v) Time to Charge EV (hours at 240v) Range (for EV) City Range (for EV - Fuel Type 1) City Range (for EV - Fuel Type 2) Hwy Range (for EV - Fuel Type 1) Hwy Range (for EV - Fuel Type 2)
ID 1.0000000 0.8981720 -0.2668937 0.2413771 0.2830843 0.2618352 0.1346307 0.1463792 0.1400545 0.0336928 -0.0031998 NA 0.1053588 NA NA 0.0876606 NA 0.0913331
Model Year 0.8981720 1.0000000 -0.2768158 0.2280128 0.2911270 0.2563010 0.1341789 0.1477942 0.1403278 0.0501838 0.0034894 NA 0.0994659 NA NA 0.0827633 NA 0.0862010
Estimated Annual Petrolum Consumption (Barrels) -0.2668937 -0.2768158 1.0000000 -0.8653379 -0.9035610 -0.9001868 -0.1689445 -0.1539511 -0.1637478 0.7331933 0.7837093 NA -0.1859223 NA NA -0.1727049 NA -0.1766306
City MPG (Fuel Type 1) 0.2413771 0.2280128 -0.8653379 1.0000000 0.9207665 0.9857637 0.1671384 0.1420879 0.1574358 -0.6771928 -0.7115445 NA 0.1414919 NA NA 0.1530046 NA 0.1510883
Highway MPG (Fuel Type 1) 0.2830843 0.2911270 -0.9035610 0.9207665 1.0000000 0.9677170 0.0890395 0.0751494 0.0835921 -0.6468990 -0.7063142 NA 0.0712719 NA NA 0.0801444 NA 0.0792459
Combined MPG (Fuel Type 1) 0.2618352 0.2563010 -0.9001868 0.9857637 0.9677170 1.0000000 0.1365411 0.1157063 0.1284624 -0.6825224 -0.7267671 NA 0.1153313 NA NA 0.1256720 NA 0.1242627
City MPG (Fuel Type 2) 0.1346307 0.1341789 -0.1689445 0.1671384 0.0890395 0.1365411 1.0000000 0.9832273 0.9970007 -0.0231218 -0.0248593 NA 0.8300337 NA NA 0.7992789 NA 0.8100197
Highway MPG (Fuel Type 2) 0.1463792 0.1477942 -0.1539511 0.1420879 0.0751494 0.1157063 0.9832273 1.0000000 0.9941925 -0.0056200 -0.0059665 NA 0.7991680 NA NA 0.7391649 NA 0.7559922
Combined MPG (Fuel Type 2) 0.1400545 0.1403278 -0.1637478 0.1574358 0.0835921 0.1284624 0.9970007 0.9941925 1.0000000 -0.0161858 -0.0173876 NA 0.8223966 NA NA 0.7776109 NA 0.7913547
Engine Cylinders 0.0336928 0.0501838 0.7331933 -0.6771928 -0.6468990 -0.6825224 -0.0231218 -0.0056200 -0.0161858 1.0000000 0.9051909 NA -0.0496963 NA NA -0.0577003 NA -0.0574926
Engine Displacement -0.0031998 0.0034894 0.7837093 -0.7115445 -0.7063142 -0.7267671 -0.0248593 -0.0059665 -0.0173876 0.9051909 1.0000000 NA -0.0602166 NA NA -0.0629302 NA -0.0634886
Time to Charge EV (hours at 120v) NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA
Time to Charge EV (hours at 240v) 0.1053588 0.0994659 -0.1859223 0.1414919 0.0712719 0.1153313 0.8300337 0.7991680 0.8223966 -0.0496963 -0.0602166 NA 1.0000000 NA NA 0.8812479 NA 0.9089379
Range (for EV) NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA
City Range (for EV - Fuel Type 1) NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA
City Range (for EV - Fuel Type 2) 0.0876606 0.0827633 -0.1727049 0.1530046 0.0801444 0.1256720 0.7992789 0.7391649 0.7776109 -0.0577003 -0.0629302 NA 0.8812479 NA NA 1.0000000 NA 0.9960186
Hwy Range (for EV - Fuel Type 1) NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA
Hwy Range (for EV - Fuel Type 2) 0.0913331 0.0862010 -0.1766306 0.1510883 0.0792459 0.1242627 0.8100197 0.7559922 0.7913547 -0.0574926 -0.0634886 NA 0.9089379 NA NA 0.9960186 NA 1.0000000
Show the code
cor_melted <- melt(cor_matrix)


#plot
ggplot(data = cor_melted, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 8, hjust = 1),
        axis.text.y = element_text(size = 8)) +
  coord_fixed() +
  labs(x = '', y = '', title = 'Correlation Matrix Heatmap')

:::

3 Data cleaning

In this section we will handle the missing value of our dataset to make sure that we have a clean dataset to perform our EDA and modeling. We will first visualize the missing values in our dataset and then clean the missing values in the columns that we will use for our analysis. We will also remove some rows and columns that are not relevant for our analysis.

Let’s have a look at the entire dataset and its missing values in grey.

We can see that overall, we do not have many missing values in proportion with the size of our dataset. However, we can see that some columns have a lot of missing values. Let’s have a look at the columns and rows with missing values more in details.

We can now more easily see the missing in our data. Below we have the detail of the pourcentage of missing values by columns.

Let’s first have a closer look at the engine cylinders and engine displacement columns.

We see that all the {r} miss_elec missing values in “Engine Cylinders” and “Engine Displacement” vehicle fuel type is only “{r} fuel_type_1_miss”. Therefore, we can conclude that all the missing values in “Engine Cylinders” and “Engine Displacement” vehicle fuel type represent all our electric vehicle. This make sense since electric vehicle do not have an combustion engine and therefore those categories are not really applicable. We will therefore replace all missing values in this two columns with “none”.

Show the code
# Create a summary dataframe of missing values by column
missing_summary_df2 <- data_cleaning %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Column", values_to = "Missing_Count") %>%
  mutate(
    Total_Rows = nrow(data),
    Proportion_Missing = Missing_Count / Total_Rows
  ) %>%
  arrange(desc(Proportion_Missing)) %>%
  select(Column, "Missing values" = Missing_Count, "Prop. Missing" = Proportion_Missing)

# Print the summary dataframe
datatable(missing_summary_df2,
          options = list(pageLength = 6,
                          class = "hover",
                          searchHighlight = TRUE),
          rownames = FALSE)%>%
  formatPercentage("Prop. Missing", 2)
Show the code
# Count the missing 'Drive' values per brand
missing_drive_by_make <- data_cleaning %>% 
  filter(is.na(Drive)) %>% 
  count(Make)

# Get total counts per brand in the entire dataset
total_counts_by_make <- data_cleaning %>% 
  count(Make)

# Calculate the percentage of missing 'Drive' values per brand
percentage_missing_drive_by_make <- missing_drive_by_make %>%
  left_join(total_counts_by_make, by = "Make", suffix = c(".missing", ".total")) %>%
  mutate(PercentageMissing = (n.missing / n.total)) %>%
  arrange(desc(PercentageMissing))

# Print the summary dataframe
datatable(percentage_missing_drive_by_make,
          options = list(pageLength = 6,
                          class = "hover",
                          searchHighlight = TRUE),
          rownames = FALSE)%>%
  formatPercentage("PercentageMissing", 2)

Show the code
# Calculate the percentage of missing 'Drive' values per brand
brand_summary <- data_cleaning %>%
  group_by(Make) %>%
  summarise(Total = n(),
            Missing = sum(is.na(Drive)),
            PercentageMissing = (Missing / Total))

# Identify brands with more than 10% missing 'Drive' data
brands_to_remove <- brand_summary %>%
  filter(PercentageMissing > brand_missing_threshold) %>%
  pull(Make)

# Filter out these brands from the dataset
data_filtered <- data_cleaning %>%
  filter(!(Make %in% brands_to_remove))

# For the remaining data, drop rows with missing 'Drive' values
data_cleaning2 <- data_filtered %>%
  filter(!is.na(Drive))
Show the code
# Create a summary dataframe of missing values by column
missing_summary_df3 <- data_cleaning2 %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Column", values_to = "Missing_Count") %>%
  mutate(
    Total_Rows = nrow(data),
    Proportion_Missing = Missing_Count / Total_Rows
  ) %>%
  arrange(desc(Proportion_Missing)) %>%
  select(Column, "Missing values" = Missing_Count, "Prop. Missing" = Proportion_Missing)

# Print the summary dataframe
datatable(missing_summary_df3,
          options = list(pageLength = 6,
                          class = "hover",
                          searchHighlight = TRUE),
          rownames = FALSE)%>%
  formatPercentage("Prop. Missing", 2)
Show the code
# Remove rows where the 'Transmission' column has missing values
data_cleaning3 <- data_cleaning2 %>%
  filter(!is.na(Transmission))
data_cleaning4 <- data_cleaning3 %>%
  mutate(Fuel.Type.2 = replace_na(Fuel.Type.2, "none"))
Show the code
# Create a summary dataframe of missing values by column
missing_summary_df3 <- data_cleaning3 %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Column", values_to = "Missing_Count") %>%
  mutate(
    Total_Rows = nrow(data_cleaning3),
    Proportion_Missing = Missing_Count / Total_Rows
  ) %>%
  arrange(desc(Proportion_Missing)) %>%
  select(Column, "Missing values" = Missing_Count, "Prop. Missing" = Proportion_Missing)

# Print the summary dataframe
datatable(missing_summary_df3,
          options = list(pageLength = 3,
                          class = "hover",
                          searchHighlight = TRUE),
          rownames = FALSE)%>%
  formatPercentage("Prop. Missing", 2)

4 Classification Tree

In this section we are going to perform a classification tree analysis on the dataset. We will first load the necessary packages and the dataset. We then prepare the data by encoding categorical variables and splitting it into training and testing sets. We then tried to pruned the tree with different max_depth values to find the optimal tree depth that balances between training and test accuracy.

We first loaded the dataset and identified make as the target variable. We also encoded categorical variables using Label Encoding to convert them into numerical values.

We then splited the dataset into training (80%) and testing (20%) sets to be able to evaluate the model’s performance on unseen data after the training to check wheter the model is overfitting or not. We will see that it does.

Trained a Decision Tree classifier on the training data without any constraints. The “None” case below represent the case without the pruning of the tree. As we can see, we observed overfitting, with high accuracy on training data and slightly lower accuracy on test data. Therefore, we decided to prune the tree as it as the advantage so simplify models and therefore limit overfitting. We chose to prune the tree by trying a few max_depth parameter values to control the tree’s growth (none, 5, 10, 15, 20, 25, 30). We want here to find the optimal tree depth that balances between training and test accuracy.

max_depth   Training Accuracy   Test Accuracy
5       0.2605      0.2550
10      0.4887      0.4677
15      0.7205      0.6349
20      0.8519      0.6938
25      0.8899      0.7000
30      0.8939      0.6992
None        0.8939      0.6984

The model’s accuracy improved as the tree’s depth increased up to a point, with a max_depth of 25 or 30 providing the best test accuracy up to 70%. We see that reducing the max_depth to 10 or 15 improves the balance between, therefore reduce drastically the case of overfitting but this is at the expense of the accuracy of our model on new data. But we can see that pruning the tree with a max depth of 25 allows us to increase our accuracy from 69.84% to 70% therefore increasing the accuracy of our model and at the same time, it reduce the gap between the test set and the trainig set. In our case, pruning the Decision Tree helps in improving its generalization performance by preventing it from becoming too complex and reduce overfitting the training data.

5 Neural Network

Show the code
import pandas as pd
import numpy as np
from pyprojroot.here import here
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.utils import to_categorical

# Load the data
data = pd.read_csv(here("data/data_cleaned.csv"))

# Display the structure of the data
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42240 entries, 0 to 42239
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   make                          42240 non-null  object 
 1   model_year                    42240 non-null  int64  
 2   vehicle_class                 42240 non-null  object 
 3   drive                         42240 non-null  object 
 4   engine_cylinders              42237 non-null  object 
 5   engine_displacement           42238 non-null  object 
 6   transmission                  42240 non-null  object 
 7   fuel_type_1                   42240 non-null  object 
 8   city_mpg_fuel_type_1          42240 non-null  int64  
 9   highway_mpg_fuel_type_1       42240 non-null  int64  
 10  fuel_type_2                   42240 non-null  object 
 11  city_mpg_fuel_type_2          42240 non-null  int64  
 12  highway_mpg_fuel_type_2       42240 non-null  int64  
 13  range_ev_city_fuel_type_1     42240 non-null  int64  
 14  range_ev_highway_fuel_type_1  42240 non-null  float64
 15  range_ev_city_fuel_type_2     42240 non-null  int64  
 16  range_ev_highway_fuel_type_2  42240 non-null  float64
 17  charge_time_240v              42240 non-null  float64
dtypes: float64(3), int64(7), object(8)
memory usage: 5.8+ MB
None
Show the code
# Display the first few rows of the data
print(data.head())
         make  model_year  ... range_ev_highway_fuel_type_2 charge_time_240v
0  Alfa Romeo        1985  ...                          0.0              0.0
1   Chevrolet        1985  ...                          0.0              0.0
2   Chevrolet        1985  ...                          0.0              0.0
3      Nissan        1985  ...                          0.0              0.0
4      Nissan        1985  ...                          0.0              0.0

[5 rows x 18 columns]
Show the code
# Identify categorical and numerical columns
categorical_cols = data.select_dtypes(include=['object']).columns.tolist()
numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Remove the target column 'make' from the features list
if 'make' in categorical_cols:
    categorical_cols.remove('make')
if 'make' in numerical_cols:
    numerical_cols.remove('make')

print(f"Categorical columns: {categorical_cols}")
Categorical columns: ['vehicle_class', 'drive', 'engine_cylinders', 'engine_displacement', 'transmission', 'fuel_type_1', 'fuel_type_2']
Show the code
print(f"Numerical columns: {numerical_cols}")
Numerical columns: ['model_year', 'city_mpg_fuel_type_1', 'highway_mpg_fuel_type_1', 'city_mpg_fuel_type_2', 'highway_mpg_fuel_type_2', 'range_ev_city_fuel_type_1', 'range_ev_highway_fuel_type_1', 'range_ev_city_fuel_type_2', 'range_ev_highway_fuel_type_2', 'charge_time_240v']
Show the code
# Define the preprocessing steps for numerical and categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(sparse_output=False), categorical_cols)  # Set sparse_output to False
    ])

# Split data into features and target
X = data.drop('make', axis=1)
y = data['make']

# Apply preprocessing and split data into training and testing sets
X_preprocessed = preprocessor.fit_transform(X)

# Encode the target variable
y_encoded = pd.get_dummies(y).values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y_encoded, test_size=0.2, random_state=123)

# Define the neural network model
model = Sequential([
    Input(shape=(X_train.shape[1],)),
    Dense(128, activation='relu'),
    Dropout(0.2),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(y_train.shape[1], activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)
Epoch 1/10

  1/845 ━━━━━━━━━━━━━━━━━━━━ 7:28 532ms/step - accuracy: 0.0312 - loss: 4.8831
 71/845 ━━━━━━━━━━━━━━━━━━━━ 0s 722us/step - accuracy: 0.0666 - loss: 4.5354  
155/845 ━━━━━━━━━━━━━━━━━━━━ 0s 656us/step - accuracy: 0.0911 - loss: 4.1518
240/845 ━━━━━━━━━━━━━━━━━━━━ 0s 632us/step - accuracy: 0.1108 - loss: 3.9163
326/845 ━━━━━━━━━━━━━━━━━━━━ 0s 620us/step - accuracy: 0.1286 - loss: 3.7437
411/845 ━━━━━━━━━━━━━━━━━━━━ 0s 614us/step - accuracy: 0.1449 - loss: 3.6078
497/845 ━━━━━━━━━━━━━━━━━━━━ 0s 609us/step - accuracy: 0.1599 - loss: 3.4945
584/845 ━━━━━━━━━━━━━━━━━━━━ 0s 605us/step - accuracy: 0.1735 - loss: 3.3984
670/845 ━━━━━━━━━━━━━━━━━━━━ 0s 602us/step - accuracy: 0.1855 - loss: 3.3163
757/845 ━━━━━━━━━━━━━━━━━━━━ 0s 599us/step - accuracy: 0.1967 - loss: 3.2425
844/845 ━━━━━━━━━━━━━━━━━━━━ 0s 597us/step - accuracy: 0.2072 - loss: 3.1766
845/845 ━━━━━━━━━━━━━━━━━━━━ 1s 782us/step - accuracy: 0.2074 - loss: 3.1751 - val_accuracy: 0.4678 - val_loss: 1.8306
Epoch 2/10

  1/845 ━━━━━━━━━━━━━━━━━━━━ 9s 11ms/step - accuracy: 0.6562 - loss: 1.4805
 79/845 ━━━━━━━━━━━━━━━━━━━━ 0s 646us/step - accuracy: 0.4628 - loss: 1.8756
129/845 ━━━━━━━━━━━━━━━━━━━━ 0s 792us/step - accuracy: 0.4570 - loss: 1.8778
196/845 ━━━━━━━━━━━━━━━━━━━━ 0s 779us/step - accuracy: 0.4548 - loss: 1.8776
282/845 ━━━━━━━━━━━━━━━━━━━━ 0s 720us/step - accuracy: 0.4526 - loss: 1.8754
368/845 ━━━━━━━━━━━━━━━━━━━━ 0s 689us/step - accuracy: 0.4519 - loss: 1.8723
455/845 ━━━━━━━━━━━━━━━━━━━━ 0s 668us/step - accuracy: 0.4525 - loss: 1.8655
542/845 ━━━━━━━━━━━━━━━━━━━━ 0s 653us/step - accuracy: 0.4537 - loss: 1.8573
629/845 ━━━━━━━━━━━━━━━━━━━━ 0s 643us/step - accuracy: 0.4548 - loss: 1.8495
715/845 ━━━━━━━━━━━━━━━━━━━━ 0s 636us/step - accuracy: 0.4559 - loss: 1.8415
799/845 ━━━━━━━━━━━━━━━━━━━━ 0s 633us/step - accuracy: 0.4572 - loss: 1.8341
845/845 ━━━━━━━━━━━━━━━━━━━━ 1s 727us/step - accuracy: 0.4579 - loss: 1.8302 - val_accuracy: 0.5328 - val_loss: 1.4940
Epoch 3/10

  1/845 ━━━━━━━━━━━━━━━━━━━━ 9s 11ms/step - accuracy: 0.4688 - loss: 1.5307
 85/845 ━━━━━━━━━━━━━━━━━━━━ 0s 600us/step - accuracy: 0.5035 - loss: 1.6000
173/845 ━━━━━━━━━━━━━━━━━━━━ 0s 587us/step - accuracy: 0.5090 - loss: 1.5966
261/845 ━━━━━━━━━━━━━━━━━━━━ 0s 583us/step - accuracy: 0.5081 - loss: 1.5969
347/845 ━━━━━━━━━━━━━━━━━━━━ 0s 583us/step - accuracy: 0.5074 - loss: 1.5963
435/845 ━━━━━━━━━━━━━━━━━━━━ 0s 580us/step - accuracy: 0.5075 - loss: 1.5954
521/845 ━━━━━━━━━━━━━━━━━━━━ 0s 582us/step - accuracy: 0.5081 - loss: 1.5925
608/845 ━━━━━━━━━━━━━━━━━━━━ 0s 581us/step - accuracy: 0.5087 - loss: 1.5895
695/845 ━━━━━━━━━━━━━━━━━━━━ 0s 581us/step - accuracy: 0.5093 - loss: 1.5865
772/845 ━━━━━━━━━━━━━━━━━━━━ 0s 588us/step - accuracy: 0.5097 - loss: 1.5840
845/845 ━━━━━━━━━━━━━━━━━━━━ 1s 726us/step - accuracy: 0.5103 - loss: 1.5813 - val_accuracy: 0.5724 - val_loss: 1.3251
Epoch 4/10

  1/845 ━━━━━━━━━━━━━━━━━━━━ 9s 12ms/step - accuracy: 0.5625 - loss: 1.2849
 80/845 ━━━━━━━━━━━━━━━━━━━━ 0s 639us/step - accuracy: 0.5315 - loss: 1.4409
165/845 ━━━━━━━━━━━━━━━━━━━━ 0s 613us/step - accuracy: 0.5319 - loss: 1.4420
247/845 ━━━━━━━━━━━━━━━━━━━━ 0s 613us/step - accuracy: 0.5336 - loss: 1.4419
333/845 ━━━━━━━━━━━━━━━━━━━━ 0s 607us/step - accuracy: 0.5346 - loss: 1.4417
420/845 ━━━━━━━━━━━━━━━━━━━━ 0s 601us/step - accuracy: 0.5355 - loss: 1.4395
508/845 ━━━━━━━━━━━━━━━━━━━━ 0s 596us/step - accuracy: 0.5362 - loss: 1.4376
594/845 ━━━━━━━━━━━━━━━━━━━━ 0s 594us/step - accuracy: 0.5368 - loss: 1.4360
681/845 ━━━━━━━━━━━━━━━━━━━━ 0s 592us/step - accuracy: 0.5374 - loss: 1.4343
766/845 ━━━━━━━━━━━━━━━━━━━━ 0s 592us/step - accuracy: 0.5378 - loss: 1.4329
844/845 ━━━━━━━━━━━━━━━━━━━━ 0s 597us/step - accuracy: 0.5382 - loss: 1.4316
845/845 ━━━━━━━━━━━━━━━━━━━━ 1s 698us/step - accuracy: 0.5382 - loss: 1.4315 - val_accuracy: 0.5912 - val_loss: 1.2380
Epoch 5/10

  1/845 ━━━━━━━━━━━━━━━━━━━━ 9s 11ms/step - accuracy: 0.6562 - loss: 1.1215
 82/845 ━━━━━━━━━━━━━━━━━━━━ 0s 625us/step - accuracy: 0.5450 - loss: 1.3962
161/845 ━━━━━━━━━━━━━━━━━━━━ 0s 630us/step - accuracy: 0.5509 - loss: 1.3727
248/845 ━━━━━━━━━━━━━━━━━━━━ 0s 612us/step - accuracy: 0.5536 - loss: 1.3629
335/845 ━━━━━━━━━━━━━━━━━━━━ 0s 604us/step - accuracy: 0.5564 - loss: 1.3548
422/845 ━━━━━━━━━━━━━━━━━━━━ 0s 598us/step - accuracy: 0.5578 - loss: 1.3500
509/845 ━━━━━━━━━━━━━━━━━━━━ 0s 595us/step - accuracy: 0.5585 - loss: 1.3471
598/845 ━━━━━━━━━━━━━━━━━━━━ 0s 591us/step - accuracy: 0.5590 - loss: 1.3450
682/845 ━━━━━━━━━━━━━━━━━━━━ 0s 592us/step - accuracy: 0.5593 - loss: 1.3438
770/845 ━━━━━━━━━━━━━━━━━━━━ 0s 590us/step - accuracy: 0.5595 - loss: 1.3431
845/845 ━━━━━━━━━━━━━━━━━━━━ 1s 685us/step - accuracy: 0.5596 - loss: 1.3428 - val_accuracy: 0.6072 - val_loss: 1.1585
Epoch 6/10

  1/845 ━━━━━━━━━━━━━━━━━━━━ 9s 11ms/step - accuracy: 0.5938 - loss: 1.3408
 84/845 ━━━━━━━━━━━━━━━━━━━━ 0s 604us/step - accuracy: 0.5867 - loss: 1.2603
171/845 ━━━━━━━━━━━━━━━━━━━━ 0s 591us/step - accuracy: 0.5809 - loss: 1.2709
259/845 ━━━━━━━━━━━━━━━━━━━━ 0s 584us/step - accuracy: 0.5798 - loss: 1.2733
346/845 ━━━━━━━━━━━━━━━━━━━━ 0s 582us/step - accuracy: 0.5785 - loss: 1.2762
434/845 ━━━━━━━━━━━━━━━━━━━━ 0s 580us/step - accuracy: 0.5778 - loss: 1.2780
522/845 ━━━━━━━━━━━━━━━━━━━━ 0s 578us/step - accuracy: 0.5772 - loss: 1.2789
610/845 ━━━━━━━━━━━━━━━━━━━━ 0s 577us/step - accuracy: 0.5768 - loss: 1.2791
697/845 ━━━━━━━━━━━━━━━━━━━━ 0s 577us/step - accuracy: 0.5764 - loss: 1.2791
785/845 ━━━━━━━━━━━━━━━━━━━━ 0s 576us/step - accuracy: 0.5761 - loss: 1.2788
845/845 ━━━━━━━━━━━━━━━━━━━━ 1s 672us/step - accuracy: 0.5761 - loss: 1.2784 - val_accuracy: 0.6158 - val_loss: 1.1046
Epoch 7/10

  1/845 ━━━━━━━━━━━━━━━━━━━━ 9s 11ms/step - accuracy: 0.5938 - loss: 1.3405
 85/845 ━━━━━━━━━━━━━━━━━━━━ 0s 602us/step - accuracy: 0.5782 - loss: 1.2464
173/845 ━━━━━━━━━━━━━━━━━━━━ 0s 586us/step - accuracy: 0.5733 - loss: 1.2495
261/845 ━━━━━━━━━━━━━━━━━━━━ 0s 581us/step - accuracy: 0.5754 - loss: 1.2446
346/845 ━━━━━━━━━━━━━━━━━━━━ 0s 584us/step - accuracy: 0.5770 - loss: 1.2398
431/845 ━━━━━━━━━━━━━━━━━━━━ 0s 585us/step - accuracy: 0.5780 - loss: 1.2378
518/845 ━━━━━━━━━━━━━━━━━━━━ 0s 583us/step - accuracy: 0.5789 - loss: 1.2352
605/845 ━━━━━━━━━━━━━━━━━━━━ 0s 582us/step - accuracy: 0.5795 - loss: 1.2329
692/845 ━━━━━━━━━━━━━━━━━━━━ 0s 582us/step - accuracy: 0.5801 - loss: 1.2310
779/845 ━━━━━━━━━━━━━━━━━━━━ 0s 582us/step - accuracy: 0.5806 - loss: 1.2297
845/845 ━━━━━━━━━━━━━━━━━━━━ 1s 677us/step - accuracy: 0.5809 - loss: 1.2288 - val_accuracy: 0.6260 - val_loss: 1.0641
Epoch 8/10

  1/845 ━━━━━━━━━━━━━━━━━━━━ 9s 11ms/step - accuracy: 0.5000 - loss: 1.4012
 85/845 ━━━━━━━━━━━━━━━━━━━━ 0s 603us/step - accuracy: 0.5877 - loss: 1.1993
173/845 ━━━━━━━━━━━━━━━━━━━━ 0s 588us/step - accuracy: 0.5937 - loss: 1.1817
261/845 ━━━━━━━━━━━━━━━━━━━━ 0s 583us/step - accuracy: 0.5963 - loss: 1.1764
350/845 ━━━━━━━━━━━━━━━━━━━━ 0s 579us/step - accuracy: 0.5968 - loss: 1.1748
437/845 ━━━━━━━━━━━━━━━━━━━━ 0s 578us/step - accuracy: 0.5968 - loss: 1.1747
525/845 ━━━━━━━━━━━━━━━━━━━━ 0s 577us/step - accuracy: 0.5967 - loss: 1.1749
613/845 ━━━━━━━━━━━━━━━━━━━━ 0s 577us/step - accuracy: 0.5969 - loss: 1.1755
701/845 ━━━━━━━━━━━━━━━━━━━━ 0s 576us/step - accuracy: 0.5971 - loss: 1.1755
788/845 ━━━━━━━━━━━━━━━━━━━━ 0s 577us/step - accuracy: 0.5975 - loss: 1.1749
845/845 ━━━━━━━━━━━━━━━━━━━━ 1s 688us/step - accuracy: 0.5977 - loss: 1.1747 - val_accuracy: 0.6353 - val_loss: 1.0281
Epoch 9/10

  1/845 ━━━━━━━━━━━━━━━━━━━━ 10s 12ms/step - accuracy: 0.7188 - loss: 0.9520
 64/845 ━━━━━━━━━━━━━━━━━━━━ 0s 802us/step - accuracy: 0.6177 - loss: 1.1262
150/845 ━━━━━━━━━━━━━━━━━━━━ 0s 677us/step - accuracy: 0.6138 - loss: 1.1328
239/845 ━━━━━━━━━━━━━━━━━━━━ 0s 636us/step - accuracy: 0.6117 - loss: 1.1335
326/845 ━━━━━━━━━━━━━━━━━━━━ 0s 620us/step - accuracy: 0.6105 - loss: 1.1337
412/845 ━━━━━━━━━━━━━━━━━━━━ 0s 613us/step - accuracy: 0.6104 - loss: 1.1330
499/845 ━━━━━━━━━━━━━━━━━━━━ 0s 607us/step - accuracy: 0.6100 - loss: 1.1325
586/845 ━━━━━━━━━━━━━━━━━━━━ 0s 603us/step - accuracy: 0.6095 - loss: 1.1326
675/845 ━━━━━━━━━━━━━━━━━━━━ 0s 598us/step - accuracy: 0.6090 - loss: 1.1329
762/845 ━━━━━━━━━━━━━━━━━━━━ 0s 595us/step - accuracy: 0.6085 - loss: 1.1332
845/845 ━━━━━━━━━━━━━━━━━━━━ 1s 689us/step - accuracy: 0.6082 - loss: 1.1333 - val_accuracy: 0.6352 - val_loss: 0.9959
Epoch 10/10

  1/845 ━━━━━━━━━━━━━━━━━━━━ 9s 11ms/step - accuracy: 0.5625 - loss: 1.1384
 85/845 ━━━━━━━━━━━━━━━━━━━━ 0s 599us/step - accuracy: 0.6116 - loss: 1.0765
171/845 ━━━━━━━━━━━━━━━━━━━━ 0s 591us/step - accuracy: 0.6143 - loss: 1.0788
257/845 ━━━━━━━━━━━━━━━━━━━━ 0s 589us/step - accuracy: 0.6151 - loss: 1.0856
344/845 ━━━━━━━━━━━━━━━━━━━━ 0s 586us/step - accuracy: 0.6145 - loss: 1.0909
431/845 ━━━━━━━━━━━━━━━━━━━━ 0s 585us/step - accuracy: 0.6137 - loss: 1.0953
517/845 ━━━━━━━━━━━━━━━━━━━━ 0s 585us/step - accuracy: 0.6133 - loss: 1.0977
604/845 ━━━━━━━━━━━━━━━━━━━━ 0s 584us/step - accuracy: 0.6131 - loss: 1.0990
691/845 ━━━━━━━━━━━━━━━━━━━━ 0s 583us/step - accuracy: 0.6131 - loss: 1.0995
774/845 ━━━━━━━━━━━━━━━━━━━━ 0s 586us/step - accuracy: 0.6129 - loss: 1.1002
845/845 ━━━━━━━━━━━━━━━━━━━━ 1s 685us/step - accuracy: 0.6129 - loss: 1.1004 - val_accuracy: 0.6491 - val_loss: 0.9588
Show the code
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)

  1/264 ━━━━━━━━━━━━━━━━━━━━ 2s 9ms/step - accuracy: 0.6562 - loss: 0.7945
144/264 ━━━━━━━━━━━━━━━━━━━━ 0s 351us/step - accuracy: 0.6353 - loss: 0.9848
264/264 ━━━━━━━━━━━━━━━━━━━━ 0s 346us/step - accuracy: 0.6396 - loss: 0.9834
Show the code
print(f'Test accuracy: {accuracy}')
Test accuracy: 0.6441761255264282
Show the code
# Make predictions
predictions = np.argmax(model.predict(X_test), axis=1)

  1/264 ━━━━━━━━━━━━━━━━━━━━ 5s 22ms/step
179/264 ━━━━━━━━━━━━━━━━━━━━ 0s 282us/step
264/264 ━━━━━━━━━━━━━━━━━━━━ 0s 272us/step
Show the code
# Print predictions
print(predictions)
[114  19  74 ...  36  74  36]
Show the code
# Plot the accuracy and loss
fig, axs = plt.subplots(2, 1, figsize=(10, 10))

# Plot training & validation accuracy values
axs[0].plot(history.history['accuracy'])
axs[0].plot(history.history['val_accuracy'])
axs[0].set_title('Model accuracy')
axs[0].set_ylabel('Accuracy')
axs[0].set_xlabel('Epoch')
axs[0].legend(['Train', 'Validation'], loc='upper left')

# Plot training & validation loss values
axs[1].plot(history.history['loss'])
axs[1].plot(history.history['val_loss'])
axs[1].set_title('Model loss')
axs[1].set_ylabel('Loss')
axs[1].set_xlabel('Epoch')
axs[1].legend(['Train', 'Validation'], loc='upper left')

plt.tight_layout()
plt.show()

Show the code
source(here::here("scripts","setup.R"))
library(data.table)

Attaching package: 'data.table'
The following objects are masked from 'package:reshape2':

    dcast, melt
The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year
The following object is masked from 'package:purrr':

    transpose
The following objects are masked from 'package:dplyr':

    between, first, last
Show the code
data_cleaned <- fread(here::here("data", "data_cleaned.csv"))

In order to see the link between the features, we can use a dimension reduction technique such as the Principal Component Analysis, aiming to link the features according to their similarities accross instances and combine features in fewer dimensions.

6 Principal Component Analysis

6.1 Biplot

Show the code
# Assuming your data frame is named data_cleaned
data_prepared <- data_cleaned %>%
  mutate(across(where(is.character), as.factor)) %>%
  mutate(across(where(is.factor), as.numeric)) %>%
  scale()  # Standardizes numeric data including converted factors

pca_results <- PCA(data_prepared, graph = FALSE)
summary(pca_results)

Call:
PCA(X = data_prepared, graph = FALSE) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
Variance               4.844   3.900   2.106   1.222   0.993   0.866   0.855
% of var.             26.913  21.666  11.700   6.790   5.519   4.811   4.750
Cumulative % of var.  26.913  48.579  60.279  67.069  72.588  77.399  82.149
                       Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
Variance               0.827   0.725   0.539   0.460   0.309   0.179   0.131
% of var.              4.595   4.028   2.996   2.557   1.718   0.992   0.729
Cumulative % of var.  86.744  90.773  93.769  96.326  98.044  99.036  99.765
                      Dim.15  Dim.16  Dim.17  Dim.18
Variance               0.034   0.008   0.000   0.000
% of var.              0.188   0.047   0.000   0.000
Cumulative % of var.  99.953 100.000 100.000 100.000

Individuals (the 10 first)
                                 Dist    Dim.1    ctr   cos2    Dim.2    ctr
1                            |  3.335 | -1.044  0.001  0.098 |  0.068  0.000
2                            |  3.410 | -1.208  0.001  0.125 |  0.197  0.000
3                            |  3.544 | -1.448  0.001  0.167 |  0.268  0.000
4                            |  2.789 | -1.245  0.001  0.199 |  0.155  0.000
5                            |  2.742 | -1.166  0.001  0.181 |  0.112  0.000
6                            |  2.742 | -1.166  0.001  0.181 |  0.112  0.000
7                            |  2.855 | -1.129  0.001  0.156 |  0.029  0.000
8                            |  2.903 | -1.317  0.001  0.206 |  0.126  0.000
9                            |  4.943 | -2.152  0.002  0.190 |  0.605  0.000
10                           |  3.448 | -2.115  0.002  0.376 |  0.596  0.000
                               cos2    Dim.3    ctr   cos2  
1                             0.000 | -0.285  0.000  0.007 |
2                             0.003 |  2.415  0.007  0.502 |
3                             0.006 |  2.351  0.006  0.440 |
4                             0.003 |  0.439  0.000  0.025 |
5                             0.002 |  0.407  0.000  0.022 |
6                             0.002 |  0.407  0.000  0.022 |
7                             0.000 |  0.239  0.000  0.007 |
8                             0.002 |  0.285  0.000  0.010 |
9                             0.015 | -0.393  0.000  0.006 |
10                            0.030 |  1.207  0.002  0.123 |

Variables (the 10 first)
                                Dim.1    ctr   cos2    Dim.2    ctr   cos2  
make                         |  0.129  0.345  0.017 | -0.135  0.469  0.018 |
model_year                   |  0.375  2.900  0.141 | -0.003  0.000  0.000 |
vehicle_class                | -0.176  0.643  0.031 |  0.098  0.244  0.010 |
drive                        | -0.048  0.047  0.002 |  0.013  0.004  0.000 |
engine_cylinders             |  0.007  0.001  0.000 |  0.000  0.000  0.000 |
engine_displacement          |  0.025  0.013  0.001 | -0.017  0.007  0.000 |
transmission                 | -0.494  5.028  0.244 |  0.110  0.310  0.012 |
fuel_type_1                  | -0.391  3.149  0.153 |  0.247  1.568  0.061 |
city_mpg_fuel_type_1         |  0.868 15.542  0.753 | -0.397  4.047  0.158 |
highway_mpg_fuel_type_1      |  0.838 14.497  0.702 | -0.409  4.284  0.167 |
                              Dim.3    ctr   cos2  
make                         -0.445  9.407  0.198 |
model_year                   -0.033  0.053  0.001 |
vehicle_class                 0.431  8.813  0.186 |
drive                         0.148  1.044  0.022 |
engine_cylinders              0.772 28.267  0.595 |
engine_displacement           0.877 36.535  0.769 |
transmission                 -0.129  0.795  0.017 |
fuel_type_1                  -0.282  3.771  0.079 |
city_mpg_fuel_type_1         -0.116  0.634  0.013 |
highway_mpg_fuel_type_1      -0.221  2.326  0.049 |
Show the code
fviz_pca_biplot(pca_results,
                geom.ind = "point",  # To show data points
                geom.var = c("arrow", "text"),  # To show variable vectors and labels
                col.ind = "cos2",  # Color by the quality of representation
                gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),  # Colors
                repel = TRUE  # Avoid text overlapping
)

The biplot shows several information. - The colored dots represent the numerical observations of the dataset. - The cos2 gradient shows the representation of the feature by the dimension, so the higher the cos2 (tending to red), the better the representation of the observation in the dimension. - The arrows represent the features in the form of the circle of correlation. Here, we have 2 dimensions which represent almost 49% of the observations. - Looking at the arrows, it shows that most of variables are stongly linked to dimension 2. We can also see that the arrows that go in opposite directions (such as fuel_type_1 and highway_mpg_fuel_type_1) are negatively correlated. From another view, fuel_type_1 and fuel_type_2 are uncorrelated.

6.2 Screeplot

Show the code
# Generating the scree plot from PCA results
fviz_eig(pca_results, 
         addlabels = TRUE,  # Adds labels to the plot indicating the percentage of variance
         ylim = c(0, 100),  # Optional: Sets the limits of the y-axis to make the plot easier to interpret
         barfill = "lightblue",  # Color of the bars
         barcolor = "black",  # Color of the borders of bars
         main = "Scree Plot of PCA")  # Title of the plot

Taking the screeplot into account, 6 dimensions are needed to reach at least 75%, meaning the features might be relatively independent. It is alredy shown in the biplot above, as most arrows in the middle seem to be shorter and the cos2 are low, meaning that the features might be more linked to other dimensions than the first 2 dimensions. To check further the correlation, we can use a heatmap.

6.3 Heatmap

Show the code
library(reshape2)

# Assuming data_prepared has been previously defined and standardized
cor_matrix <- cor(data_prepared)  # Calculate correlation matrix

# Melt the correlation matrix for ggplot2
melted_cor_matrix <- melt(cor_matrix)
Warning: The melt generic in data.table has been passed a matrix and will
attempt to redirect to the relevant reshape2 method; please note that reshape2
is superseded and is no longer actively developed, and this redirection is now
deprecated. To continue using melt methods from reshape2 while both libraries
are attached, e.g. melt.list, you can prepend the namespace, i.e.
reshape2::melt(cor_matrix). In the next version, this warning will become an
error.
Show the code
# Heatmap with all correlation coefficients displayed
ggplot(melted_cor_matrix, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +  # Add white lines to distinguish the tiles
  geom_text(aes(label = sprintf("%.2f", value)), color = "black", size = 3.5) +  # Always display labels
  scale_fill_gradient2(low = "lightblue", high = "darkblue", mid = "blue", midpoint = 0, limit = c(-1,1),
                       name = "Spearman\nCorrelation") +  # Use gradient2 for a diverging color scheme
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.title.position = "plot") +
  labs(x = 'Variables', y = 'Variables', 
       title = 'Correlations Heatmap of Variables')  # Adjust the title and labels as needed

This heatmap indicates the correlation between the variables. It shows that the correlations aren’t that strong between variables, expect for a few such as mighway_mpg_fuel and city_mpg_fuel.